feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233
Open
xuexu6666 wants to merge 9 commits into
Open
feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233xuexu6666 wants to merge 9 commits into
xuexu6666 wants to merge 9 commits into
Conversation
GPU CRUD benchmark records could not reliably distinguish managed vs
fully-managed GPU pools: the stable azure-mgmt-containerservice SDK does
not model gpuProfile.nvidia.managementMode, so a fully-managed pool's mode
is silently dropped from the nodepool_info read-back, leaving only the pool
name (a brittle string) to tell them apart. MIG single vs mixed was also
dropped on scale ops, and scale_node_pool never even recorded the managed
flag.
Derive a normalized set of GPU metadata from the operation INPUT flags and
attach it to create, scale, and progressive-scale operations:
- gpu_mode: "none" | "managed" | "fully_managed"
- enable_managed_gpu, mig_enabled
- gpu_instance_profile, gpu_mig_strategy ("single" | "mixed")
Thread gpu_mig_strategy through the scale path (main.py -> NodePoolCRUD ->
AKSClient -> _progressive_scale), and pass the scale-path GPU flags through
execute.yml so scale-up/scale-down records are accurate in the pipeline.
Add unit tests for the metadata helper variants and for scale ops persisting
gpu_mode + MIG fields.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the fidelity of GPU CRUD benchmark records by deriving and persisting normalized GPU metadata (managed vs fully-managed mode and MIG configuration) from operation input flags, rather than relying on AKS SDK read-back fields that may be missing.
Changes:
- Add a normalized GPU metadata helper (
gpu_mode, MIG fields) and attach it to create/scale/progressive-scale operation metadata. - Thread
gpu_mig_strategythrough the Azure scale call chain (main.py → NodePoolCRUD → AKSClient → _progressive_scale). - Update the pipeline execution step to pass MIG strategy on scale operations, and add unit tests covering metadata normalization and scale metadata persistence.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/engine/crud/k8s/execute.yml | Passes additional GPU/MIG CLI flags into create/scale steps so pipeline records can include them. |
| modules/python/tests/clients/test_aks_client.py | Adds unit tests for GPU metadata normalization and scale operation metadata persistence. |
| modules/python/crud/main.py | Threads gpu_mig_strategy into scale/all kwargs so the CLI can pass it into CRUD implementations. |
| modules/python/crud/azure/node_pool_crud.py | Extends Azure NodePoolCRUD signatures/forwarding to include gpu_mig_strategy. |
| modules/python/clients/aks_client.py | Introduces _gpu_mode_metadata and attaches it to operation metadata across create/scale/progressive-scale. |
Existing exact-call assertions in test_main.py and test_azure_node_pool_crud.py broke because scale_node_pool / all now receive gpu_mig_strategy. Add the new kwarg to the expected calls.
The helper is intentionally tested directly; disable W0212 for that test method.
…-access The def-line block disable did not suppress W0212 in CI; bind the protected helper to a single inline-disabled local and call that instead.
…t GPU metadata - main.py: gpu_instance_profile/gpu_mig_strategy are Azure-only MIG inputs; the AWS CRUD does not accept them (no **kwargs) and would raise TypeError. Only forward them when --cloud azure, via a conditional azure_gpu_kwargs dict. - aks_client._gpu_mode_metadata: normalize flag combinations so records are always internally consistent — enable_managed_gpu/MIG only apply to a GPU pool; MIG only applies to fully-managed pools (dropped otherwise); reject invalid gpu_mig_strategy values. - execute.yml: gate --gpu-instance-profile/--gpu-mig-strategy on ENABLE_MANAGED_GPU==true (create, scale-up, scale-down) so MIG flags are only emitted for fully-managed pools. - tests: set cloud="azure" on the GPU-kwarg cases, add an AWS regression test asserting MIG kwargs are omitted, and cover the new metadata normalization.
Log gpu_mode/enable_managed_gpu/mig_enabled/gpu_instance_profile/gpu_mig_strategy on each create/scale/progressive-scale op so the GPU mode is visible directly in pipeline logs (not only in the uploaded results.json). Skipped for non-GPU pools.
…s (C0302) The console-echo helper pushed the module to 1014 lines (limit 1000). Condense the _gpu_mode_metadata docstring and the gpu_mig_strategy ValueError message.
A100 MIG-single node provisioning can exceed the previous 1200s (20 min) scale poller timeout, causing spurious scale failures. Bump _begin_update_with_retry default timeout to 1800s (30 min).
…n nvidia-smi verify verify_nvidia_smi_on_node broke its readiness wait as soon as the nvidia.com/gpu key appeared, even with value 0. On a freshly-scaled MIG-single node the device plugin registers nvidia.com/gpu=0 before publishing the MIG instances, so the check raced in, read 0, and skipped the node with a misleading 'has no GPUs' warning. Wait for a positive GPU or MIG slice count instead of mere key presence. Tests: update no-GPU test for the new wait semantics; add a regression test where a node reports 0 then a positive count and must be verified (not skipped).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
GPU CRUD benchmark records could not reliably distinguish managed vs fully-managed GPU pools. The stable
azure-mgmt-containerserviceSDK does not modelgpuProfile.nvidia.managementMode, so a fully-managed pool's mode is silently dropped from thenodepool_inforead-back — leaving only the pool name (a brittle string) to tell them apart. On top of that:singlevsmixedwas dropped on scale operations.scale_node_poolnever recorded the managed flag at all.What this does
Derive a normalized set of GPU metadata from the operation input flags (not the lossy AKS read-back) and attach it to create, scale, and progressive-scale operations:
gpu_modenone|managed|fully_managedenable_managed_gpumig_enabledgpu_instance_profileMIG1g|nullgpu_mig_strategysingle|mixed|nullgpu_mig_strategythrough the scale path (main.py→NodePoolCRUD→AKSClient→_progressive_scale) and throughexecute.ymlso scale-up/scale-down records are accurate.GPU pool metadata: gpu_mode=… mig_enabled=… …) so it is visible in pipeline logs, not only in the uploadedresults.json.Robustness / review hardening
gpu_instance_profile/gpu_mig_strategyare Azure-only MIG inputs; the AWS CRUD does not accept them (no**kwargs). They are now only forwarded when--cloud azure, avoiding aTypeErroron AWS runs._gpu_mode_metadatanormalizes flag combinations so records are always internally consistent: managed/MIG only apply to a GPU pool; MIG only applies to fully-managed pools (dropped otherwise); invalidgpu_mig_strategyis rejected.execute.ymlgates--gpu-instance-profile/--gpu-mig-strategyonENABLE_MANAGED_GPU==true(create, scale-up, scale-down) so MIG flags are only emitted for fully-managed pools.Incidental fixes (folded in)
_begin_update_with_retry): A100 MIG node provisioning could exceed the old 1200s timeout, causing spurious scale failures.verify_nvidia_smi_on_nodeMIG-single race: the readiness wait broke as soon as thenvidia.com/gpukey appeared, even with value0. A freshly-scaled MIG-single node registersnvidia.com/gpu=0before publishing its MIG instances, so the check raced in, read0, and skipped the node with a misleading "has no GPUs" warning. Now waits for a positive GPU/MIG count.Verification
Validated end-to-end on the GPU Cluster CRUD pipeline — latest run build 71637 (commit
e07404fa, succeeded), A100 chainmanaged → fully_managed → MIG-mixed → MIG-single.Recorded
metadatain the publishedresults.jsonis correct on every create/scale_up/scale_down op (confirmed across builds 71550/71623/71637):managedFalseFalseNoneNonefully_managedTrueFalseNoneNonefully_managedTrueTrueMIG1gmixedfully_managedTrueTrueMIG1gsingleAlso confirmed in build 71637 logs:
GPU pool metadata: gpu_mode=managed enable_managed_gpu=False mig_enabled=False gpu_instance_profile=None gpu_mig_strategy=None.nvidia.com/gpu=0race is fixed —a100migsinis no longer skipped.managedvsfully_managedare now distinguishable, and MIGmixedvssingleis recorded on scale ops as well as create.Known / out-of-scope (not introduced by this PR)
Observed in build 71637 but unrelated to these changes:
southcentralus(ErrCode_InsufficientVCPUQuota, requested 192 / remaining 0) during ana100migmixedscale-up — environmental capacity, not a code issue.a100managed(driver-install) nodes did not advertisenvidia.com/gpuwithin the wait window, so nvidia-smi verification skipped them. This is pre-existing behavior (the prior code also waited and skipped, since the key never appears) — a device-plugin scheduling/timing issue on the managed A100 path, worth a separate follow-up.Testing
--cloud aws, and averify_nvidia_smi_on_noderegression test (node reports0then a positive count → verified, not skipped).